Scientific Python antipatterns advent calendar day twenty

For today, a collection of related patterns to do with comparing data points. As a reminder, I’ll post one tiny example per day with the intention that they should only take a couple of minutes to read.

If you want to read them all but can’t be bothered checking this website each day, sign up for the mailing list:

Sign up for the mailing list

and I’ll send a single email at the end with links to them all.

Combinations, permutations and products

Let’s take our familiar list of fruits:

fruits = ["apple", "banana", "orange", "mango"]

and try to do something that requires comparing them in a pairwise fashion. For example, we might want to find pairs of fruits that are the same length. The obvious way to do this is with a nested loop:

for fruit_one in fruits:
    for fruit_two in fruits:
        if len(fruit_one) == len(fruit_two):
            print(fruit_one, fruit_two)
apple apple
apple mango
banana banana
banana orange
orange banana
orange orange
mango apple
mango mango

If we ignore the actual logic and just concentrate on the pairwise comparisons, then the core bit of logic is this:

for fruit_one in fruits:
    for fruit_two in fruits:
        print(fruit_one, fruit_two)
apple apple
apple banana
apple orange
apple mango
banana apple
banana banana
banana orange
banana mango
orange apple
orange banana
orange orange
orange mango
mango apple
mango banana
mango orange
mango mango

This pattern, where every possible pairwise combination of inputs is tried, is called a product by the mathematically inclined. A couple of things are noticable about a product. Firstly, it includes each element paired with itself e.g. apple apple. Secondly, it includes the comparison in both directions, e.g. apple banana then later banana apple.

For many data analysis purposes, we might not need these. The length of a string is always going to be equal to itself, so if we want to remove self-comparisons from our output we might add in a condition:

for fruit_one in fruits:
    for fruit_two in fruits:

        # don't compare with self
        if fruit_one == fruit_two:
            continue
            
        if len(fruit_one) == len(fruit_two):
            print(fruit_one, fruit_two)
apple mango
banana orange
orange banana
mango apple

which will remove self-comparisons from the output. If we again get rid of the length logic and just look at the comparisons:

for fruit_one in fruits:
    for fruit_two in fruits:

        # don't compare with self
        if fruit_one == fruit_two:
            continue
            
        print(fruit_one, fruit_two)
apple banana
apple orange
apple mango
banana apple
banana orange
banana mango
orange apple
orange banana
orange mango
mango apple
mango banana
mango orange

We now have what is mathematically called a permutation - all possible pairs of two different fruits, in every possible order.

To avoid having duplicated pairs in both orders requires a larger change to the code. One way to do it is to switch to using indices, which allows us to start the second loop only after the position of the first:

for i in range(len(fruits)):
    for j in range(i + 1, len(fruits)):
        fruit_one = fruits[i]
        fruit_two = fruits[j]
        print(fruit_one, fruit_two)
apple banana
apple orange
apple mango
banana orange
banana mango
orange mango

This gives us all possible different pairs in only a single order. This is known as a combination.

Both the combination and permutation code is harder to read than our original double loop, and makes the logic harder to understand and more error prone. The reason that we have been noting the mathematical names for these comparisons is because they are also the names of the functions in the standard library that will do the work for us.

For all possible pairs, we have itertools.product:

import itertools
list(itertools.product(fruits, repeat=2))
[('apple', 'apple'),
 ('apple', 'banana'),
 ('apple', 'orange'),
 ('apple', 'mango'),
 ('banana', 'apple'),
 ('banana', 'banana'),
 ('banana', 'orange'),
 ('banana', 'mango'),
 ('orange', 'apple'),
 ('orange', 'banana'),
 ('orange', 'orange'),
 ('orange', 'mango'),
 ('mango', 'apple'),
 ('mango', 'banana'),
 ('mango', 'orange'),
 ('mango', 'mango')]

Above we have done it as a list to show the output, but in real code we can just iterate directly:

# all combinations of two fruits
for fruit_one, fruit_two in itertools.product(fruits, repeat=2):
    if len(fruit_one) == len(fruit_two):
            print(fruit_one, fruit_two)
apple apple
apple mango
banana banana
banana orange
orange banana
orange orange
mango apple
mango mango

If we want to avoid self comparisons, we can just switch to permutations:

# all combinations of two different fruits
for fruit_one, fruit_two in itertools.permutations(fruits, r=2):
    if len(fruit_one) == len(fruit_two):
            print(fruit_one, fruit_two)
apple mango
banana orange
orange banana
mango apple

and to avoid duplicates we use combinations:

# all combinations of two different fruits
for fruit_one, fruit_two in itertools.combinations(fruits, r=2):
    if len(fruit_one) == len(fruit_two):
            print(fruit_one, fruit_two)
apple mango
banana orange

Compared to our manual code, this is easier both to write and to read. It will also make it much more straightforward to change if we decide in the future that we do want to do comparisons in both orders.

Notice that all three of these functions have a second argument that determines how many fruits we want in each collection, called either repeat or r. This makes it very easy to try more combinations without writing extra loops. For example, here are all the unique combinations of three fruits:

for f1, f2, f3 in itertools.combinations(fruits, r=3):
    print(f1, f2, f3)
apple banana orange
apple banana mango
apple orange mango
banana orange mango

One more time; if you want to see the rest of these little write-ups, sign up for the mailing list:

Sign up for the mailing list